Latent Semantic Analysis 1 Running head: LATENT SEMANTIC ANALYSIS AND KNOWLEDGE ASSESMENT Using Latent Semantic Analysis to assess knowledge: Some technical considerations
نویسندگان
چکیده
In a previous paper (Wolfe, Schreiner, Rehder, Laham, Foltz, Landauer, & Kintsch, this issue) we have shown how Latent Semantic Analysis (LSA) can be used to assess student knowledge how essays can be graded by LSA and how LSA can match students with appropriate instructional texts. We did this by comparing an essay written by a student with one or more target instructional texts in terms of the cosine between the vector representation of the student's essay and the instructional text in question. This simple method was effective for the purpose, but questions remain about how LSA achieves it’s results and how they might be improved. Here we address four such questions: (a) what role use of technical vocabulary per se plays, (b) how long should the student essays be, (c) whether the cosine is optimal measure of semantic relatedness, and (d) how to deal with the directionality of knowledge in the high-dimensional space. Latent Semantic Analysis 3 Using Latent Semantic Analysis to assess knowledge: Some technical considerations The Role of Technical Terms The semantic relatedness between a student's essay on a certain topic and an instructional text in that domain proved to be a reliable measure of student knowledge and a valuable predictor of how much the student could learn from the text. What features of the essays were responsible for these properties? Specifically, what role does the technical vocabulary play? Does LSA owe its success merely to the students' use of technical vocabulary? If a student had generated an unstructured ‘bag of technical words’ that didn’t really reflect high level understanding (e.g., a well-developed situation model), would LSA have done equally well? To investigate this question we re-analyzed data from Wolfe et al (see for necessary details) as follows: (1) A list of technical heart/circulatory terms was developed using a lose criterion (e.g., "left" and "right" were counted as non-technical terms; "pump", "body", "purple" and "red" were counted as technical terms, as were more sophisticated terms, such as "superior", "bulbo" and "spiral"). (2) The participants’ original (Original) preand post-learning essays was separated into technical terms (Technical) and other words (NonTechnical). 47.9% of the words on which LSA bases it’s analysis which the students wrote were classified as technical. (3) Two new LSA vectors for each student's essay were computed, one based on the technical words only and one based on the non-technical words only. Latent Semantic Analysis 4 (4) Cosines between the Technical and Non-technical essay vectors and Text C (in addition to the original, whole-essay cosines obtained by Wolfe et al.) were computed. (5) The pattern of correlations between the three types of cosine measures (i.e. Original, Technical and Non-Technical) and the students' performance on the pre-questionnaire test were compared. (Recall that the pre-questionnaire test was a 40 item short answer test, taken before writing an essay and reading the instructional text.) The correlations between the pre-questionnaire scores and the three types of cosine measures for all 106 participants in the Wolfe, et al study (94 undergraduates and 12 medical students) are given in Table 1; all correlations are significant p < .0001 level. PreQuestionnaire Original NonTechnical Original .71 Non-Technical .59 .83 Technical .69 .94 .63 Table 1. Correlations between pre-questionnaire scores and the three cosine measures. It is true, as one might have suspected, that cosines computed from essay vectors that contain only a list of the technical terms the students used correlated about as highly with the pre-questionnaire scores as cosines Latent Semantic Analysis 5 computed from the intact essays. Thus, the technical vocabulary a student uses makes a difference for LSA, and, indeed, an appropriately weighted count of technical word use in an essay can serve as an effective measure. What is surprising, however, about the correlations in Table 1 is that cosines computed from the non-technical words in the essays yielded almost equally good predictions as the cosines computed from the technical vocabulary! The non-technical words students use in describing the functioning of the heart contain a great deal of information about their knowledge of the heart indeed almost as much as their technical vocabulary, or the essays as a whole. We conclude that nothing is to be gained by separating essays into technical and non-technical terms (which is neither easy nor straightforward). What if we had students simply generate a list of technical, heartrelated terms, instead of writing an essay, and then used the vector representation of that list in the LSA space as our estimate of heart knowledge? Table 1 suggests that such a procedure might be effective, although one cannot simply equate such a list with the technical terms we extracted from the students' essays. It is not clear that the same or even a similar list would be generated by the two procedures, given the very different task demands involved (e.g., our list contains repetitions). Perhaps the only way that a student can generate an accurate set of words is to compose a good essay. This conjecture require further research. Essay Word Count In the Wolfe et al experiment, participants were instructed to write an essay of approximately 250 words. Nevertheless, there was a fair amount of variability in essay word count. The mean length of the complete pre-essays was 261.2 words (s.d. = 15.02), with word counts ranging from 209 words to 306 words. The correlation between essay word count and pre-questionnaire Latent Semantic Analysis 6 score was non-significant (r = .12). This result may be due to the fact that the length of participants' essays were constrained (to about 250 words). In studies where essay length is less constrained, essay word count is strongly related to knowledge. For example, Laham and Landauer (1996) found that the length of essays written during a class period as part of a psychology test predicted the grade that the essay received. Page (1994) has also found essay word count to be a strong predictor of domain knowledge. How long should an essay be in order to get an accurate estimate of how much a participant knows? In order to answer this question, we looked at the effectiveness of the LSA cosine measure in predicting pre-questionnaire scores as a function of essay word count. From each participant’s essay we created 19 sub-essays: the first sub-essay consisted of the first 10 words, the second sub-essay consisted of the first 20 words, and so on up to 200 words (up to the minimum essay word count). We then calculated cosines between each sub-essay and the standard instructional text, Text C. Next we calculated the correlation between the set of essays of a given length (e.g., first all 106 10 word essays, then all 106 20 word essays, etc.) and each participant’s prequestionnaire score. The proportion of the variance accounted for by each of the 19 essay/pre-questionnaire correlations is given in Figure 1. Latent Semantic Analysis 7 0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1 % v ar ia n ce a cc o u n te d f o r 0 30 60 90 120 150 180 210 truncated essay word length Figure 1. The proportion of variance accounted for (r^2) when predicting pre-questionnaire scores from the cosines of students' essays and Text C as a function of truncated essay word count. The first 60 words of essays are non-predictive of knowledge level, at least under the conditions of the present study, where students were instructed to write 250-word essays. Between 70 and 200 words the cosine of the essay becomes increasingly predictive of the participants’ knowledge level, but with decreasing marginal returns. Thus, the accuracy gained in measuring domain knowledge with essays considerably longer than 200 words may be negligible. Given the practical difficulties in gathering essays from students, 200 word essays appear to be a reasonable compromise. Alternative LSA-Derived Measures of Knowledge In previous LSA work, the cosine between two document vectors has been the primary measure of the similarity between two documents. However, there are other possible LSA-derived measures that could be used, such as the dot product or Euclidean distance between the two vectors, or the Latent Semantic Analysis 8 length of an individual vector. In this section, we explore the usefulness of these alternatives as measures of knowledge in a domain. A vector can be thought of as a position within an n-dimensional space. The value of a vector is represented as a series of coefficients, each coefficient representing a value (or distance) along a particular dimension in the n-dimensional space. Thus, if X and Y are vectors in an n-dimensional space, then X and Y are written as: X = (x1, x2, ... ,xn) and Y = (y1, y2, ... , yn) The inner product, or dot product, of vectors X and Y is defined as: X•Y = x1y1 + x2y2 + ... + xnyn It is important to note that X•Y is a scalar, not a vector. The length of a vector X is defined as: ||X|| = (X•X) = (∑i=1..n xi 2 ) (1) If we use the symbol θ to denote the angle between vectors X and Y, then the cosine of θ is defined as: cos θ = X•Y ||X|| * ||Y|| The value of θ can range between 0 and π; cos θ can range between -1 and 1. As it is used in LSA, the cosine between two document vectors roughly refers to the similarity of the document vectors while factoring out the length of the vectors. As we apply it to the Wolfe et al. (this issue) data, the cosine between the vector of an individual's pre-essay about the heart (E) and the vector of a standard instructional heart text is considered as a measure of preknowledge about the heart. We use the vector for Text C (see footnote 1) as the standard instructional text, so we refer to this cosine measure as cos EC. How well cos EC serves as a measure of knowledge may be assessed by correlating it with independent knowledge measures. The dot product Latent Semantic Analysis 9 between an essay's vector and the vector for Text C, (E•C), the Euclidean distance between the essay's vector and Text C's vector, dist EC, and the length of the essay's vector itself, (||E||), are also considered as candidate knowledge measures. For these analyses we use the data of the 94 undergraduates from Wolfe et al. (this issue). Cos EC, E•C, dist EC, and ||E|| are each individually highly significant predictors of pre-knowledge as measured by the prequestionnaire and the pre-essay grades. These correlations are presented in Table 2 (the dim-method variables will discussed in the following section). The largest correlations are found using the dot product, E•C, rather than with cos EC, a surprising result in light of the superiority of cosines as a similarity measure found in the LSA and LSI (latent semantic indexing) literature (Harman, 1986). Pre-questionnaire Pre-essay cos EC 0.68 0.62 E•C 0.76 0.73 dist EC -0.72 -0.69 ||E|| 0.65 0.65 dim-method 1 0.67 0.62 dim-method 2 0.70 0.63 dim-method 3 0.83 0.72 Table 2: Correlations of pre knowledge assessment scores and LSA measures. All p values <.0001, n=94. Because cos EC, E•C, dist EC, and ||E||, are all correlated with one another, an obvious next step is to perform a multiple regression using all four variables as predictors. Before proceeding however, it is instructive to Latent Semantic Analysis 10 consider the mathematical relations between these variables. For example, the dot product is given by E•C = (cos EC) (||E||) (||C||) where (||C||) is the length of the vector of Text C. Because (||C||) is constant, a new variable, E•C'= (cos EC) (||E||), will have the same correlations with the pre-knowledge measures as does E•C. In other words, for purposes of predicting pre-knowledge scores, E•C may be viewed as a function of cos EC and ||E||. In particular, E•C may be interpreted as the interaction term between cos C and ||E||. In the Appendix we prove that predicting pre-knowledge measures with dist EC2, a monotonic transformation of Euclidean distance, dist EC, is equivalent to predicting pre-knowledge from a linear combination of (cos EC) (||E||) and ||E||2. The fact that E•C is a function of (cos EC) (||E||), and that dist EC is a function of (cos EC) (||E||) and ||E||2, suggests a multiple regression equation in which (cos EC) (||E||) and ||E||2 are used as predictors of pre-knowledge scores along with cos EC and ||E||. The results of such a multiple regression predicting pre-questionnaire scores are shown in Table 3. Cos EC and ||E|| are highly significant predictors above and beyond all other variables, including each other. The interaction term (i.e., a scaled version of the dot product) was not. In other words, although the dot product was the most successful individual LSA measure for predicting preknowledge, it provides no additional predictive value above and beyond cos EC and ||E|| together. The failure of the interaction term to reach significance also reveals that cos EC and ||E|| are independent contributors to representing domain knowledge. Finally, the two terms for Euclidean distance, (cos EC) (||E||) and ||E||2, provide no additional predictive value above and beyond cos EC and ||E|| together. Thus, we conclude that cos EC Latent Semantic Analysis 11 and ||E|| exhaust the representation of knowledge embedded in the four LSA variables we originally considered in Table 2. Predictor Partial Correlation Standardized Beta Weight F(1,93) cos EC .53 0.46 35.4 *** ||E|| .51 0.43 32.1 *** (cos EC) (||E||) .08 -0.09 <1 ||E||2 .03 -0.03 <1 Table 3: Results of multiple regression where pre-questionnaire scores are predicted from cos EC, ||E||, (cos EC) (||E||), and ||E||2. Multiple R^2 = 0.61. *** = p<.0001. The fact that cos EC is a highly significant predictor of pre-knowledge is expected, as cos EC reflects the direction of an essay's vector in the highdimensional LSA space, and the vector's direction is interpreted as the representation of the quality of the semantic content of the essay. In order to interpret the finding that essay vector length, ||E||, is a highly significant predictor of pre-knowledge above and beyond cos EC, we must consider what LSA spaces are, and how LSA vectors are computed in an LSA space. An LSA space is intended to represent a multi-dimensional semantic space. As the degree of association with one or more semantic dimensions increases, the length of an essay's vector increases. A number of factors influence how the semantic associations represented by an essay's vector are determined. First, the LSA space we used was constructed from encyclopedia articles only about the heart. As a result, words that are "off-topic" (not about the heart) do not appear in the LSA Latent Semantic Analysis 12 matrix, and hence cannot affect an essay's vector representation, including its length. Second, words that appear rarely in the heart encyclopedia articles (such as technical words) are more heavily weighted by LSA than words that appear less frequently, under the assumption that rare words are more likely to distinguish documents from one another semantically. Heavily weighted words increase the degree of semantic association relative to less-heavily weighted words, resulting in longer vectors. Third, before being converted to their LSA vector representations, essays were first submitted to a "stop list" that removed their non-content words such as "the" and "of" (reducing the size of the essays by an average of 56%, mean=116, s.d.=11.15). Fourth, broad essays that are associated with a number of semantic dimensions will have a longer vector length that essays that are written narrowly. To summarize, an essay's vector length is (a) a strong positive function of the number of rare (often technical) heart words, (b) a moderate positive function of the number of common heart words, (c) a function of the breadth of heart knowledge expressed in the essay, and (d) unrelated to the number of noncontent and off-topic words. Thus, the length of an essay's vector reflects an individual's general knowledge about the heart (or, at least, the general knowledge about the heart embedded in the encyclopedia articles). In contrast, the cosine measure reflects the more narrow knowledge embedded in the standard instructional text, Text C. Therefore, knowledge measured both broadly and narrowly is important for predicting the performance on the pre-questionnaire employed in this study. The importance of LSA vector lengths as a representation of domain knowledge needs to be supported by replication in other domains. We can cite one additional study from our laboratory. The Laham and Landauer (1996) Latent Semantic Analysis 13 study described earlier used an LSA vector representation of essays in the domain of psychology to predict the grades that the essays received, and found vector length to be a highly significant predictor of the grades of the essays above and beyond a cosine measure, and to be more strongly correlated with grade than essay word count. Thus, the importance of vector length in representing knowledge may have some generality. Essay Word Count versus Essay Vector Length . It is instructive to consider the importance of essay word count in determining the length of an essay's vector. If all essays had the same proportion of rare heart words, common heart words, and non-content and off-topic words, then essay word count would be a strong predictor of essay vector length (because essay word count would predict the number of rare heart words, the number of common heart words, etc.). In fact, the correlation between essay word count and essay vector length is not significant (r=.14). Furthermore, whereas essay vector length is a strong predictor of pre-knowledge (even above and beyond the cosine measure), in a previous section we showed that essay word count was not significantly related to pre-knowledge. For these reasons, we conclude that as essay word count increases that the relative proportion of heart words (rare and common) and non-content and off-topic words changes such that there are fewer heart words. Thus, those subjects that wrote longer essays did not necessarily possess more heart knowledge, and our use of LSA was able to detect that fact. However, the lack of relationship between essay word count and essay vector length may be due to the fact that participants were constrained to write an essay of 250 words. With less-constrained essays, Laham and Landauer (1996) found that essay word count and vector length were highly correlated (r=.96). Latent Semantic Analysis 14 The Goldilocks Principle, and The Problem of Directionality in High-Dimensional Space In Wolfe et al. (this issue) we investigated the hypothesis that learning is optimal when a text is neither too easy nor too hard relative to the learner's background knowledge, a hypothesis we have come to call “the Goldilocks Principle”. The cosine between an essay the student wrote before receiving instruction and the instructional text served as an estimate of how difficult that text would be for that student: if the cosine was very high, the text was too easy, if it was very low, the text was too hard, if it was intermediate the text was just right. The cosine, however, measures relatedness as an unsigned angle in a high-dimensional space. The essays of two individuals may have the same cosine with an instructional text, but the essay of the first individual may be dissimilar to the text because the individual knows very little about the topic (relative to the text), whereas the essay of the second individual may be dissimilar to the text because the individual knows very much about the topic (relative to the text). Take the hypothetical case of elementary school students and cardiovascular surgeons, all of whom write 250 word essays about the functioning of the human heart. When compared to an undergraduate level text, these two groups of essays might well have similar cosines. Both are equally dissimilar to the undergraduate text, but for very different reasons: the surgeons' essays are dissimilar because they know significantly more than is contained in the undergraduate essay, whereas the elementary school students' essays are dissimilar because they know significantly less. We believe that this problem, which we refer to as the directionality problem, did not materially affect our analysis of the undergraduate data in Latent Semantic Analysis 15 Wolfe et al (this issue) because all the undergraduates who participated in the study knew little about the heart relative to the four instructional texts. As long as all potential learners have considerably less knowledge than all the potentially to-be-read text, it is clear how the LSA cosine similarity measure can serve as a proxy for a knowledge measure: A higher cosine reflects more knowledge and a lower cosine reflects less knowledge. The directionality problem can be most clearly illustrated with the medical school students who participated in the Wolfe et al study. The fact that the medical students were relatively high-knowledge and the undergraduates were relatively low-knowledge, raises the possibility that the directionality problem might come into play. That is, these two groups might have the same cosines (similarity) with a text, even though one group is above the text and the other below it. Figure 2 presents the distribution of cosines with Text A for the undergraduates and for the medical students. The cosines of the 2 groups do not differ significantly ( t(104) = 1.76, p > .05). 0 5 10 15 20 25 30 35
منابع مشابه
Meaning of “the Right Imam” based upon the Holy Quran’s Verses
The concept of “the Right Imam” is one of the most significant Quranic concepts and has attracted the attention of various jurisprudential, theological, mystical, interpretative, narrative and historical schools. However, it has not been dealt with by a semantic approach yet. Although the word “Imam” with the meaning of right leader has been used in 5 ranks in the Holy Quran, it could be said t...
متن کاملEnglish-to-Indonesian Lexical Mapping using Latent Semantic Analysis
This paper describes an attempt to automatically map English words and concepts, derived from the Princeton WordNet, to their Indonesian analogues appearing in an Indonesian lexicon. Using Latent Semantic Analysis (LSA), a semantic model is derived from an English-Indonesian parallel corpus. Given a particular word or concept, the semantic model is then used to identify its neighbours in a high...
متن کاملApplications of Latent Semantic Analysis to Lessons Learned Systems
This paper will present several examples of the application of Latent Semantic Analysis (LSA) to practical problems of knowledge acquisition, management and assessment. The purpose of this presentation is to make other knowledge management (KM) and artificial intelligence (AI) researchers aware of the value of LSA as an automated technique for improving the utility of Lessons Learned (LL) and s...
متن کاملEmotion Analysis Using Latent Affective Folding and Embedding
Though data-driven in nature, emotion analysis based on latent semantic analysis still relies on some measure of expert knowledge in order to isolate the emotional keywords or keysets necessary to the construction of affective categories. This makes it vulnerable to any discrepancy between the ensuing taxonomy of affective states and the underlying domain of discourse. This paper proposes a mor...
متن کاملEvaluation of Background Knowledge for Latent Semantic Indexing Classification
This paper presents work that evaluates background knowledge for use in improving accuracy for text classification using Latent Semantic Indexing (LSI). LSI’s singular value decomposition process can be performed on a combination of training data and background knowledge. Intuitively, the closer the background knowledge is to the classification task, the more helpful it will be in terms of crea...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997